124 research outputs found

    Comprehensive synchronization elimination for Java

    Get PDF
    AbstractIn this paper, we describe three novel analyses for eliminating unnecessary synchronization that remove over 70% of dynamic synchronization operations on the majority of our 15 benchmarks and improve the bottom-line performance of three by 37–53%. Our whole-program analyses attack three frequent forms of unnecessary synchronization: thread-local synchronization, reentrant synchronization, and enclosed lock synchronization. We motivate the design of our analyses with a study of the kinds of unnecessary synchronization found in a suite of single- and multi-threaded benchmarks of different sizes and drawn from a variety of domains. We analyze the performance of our optimizations in terms of dynamic operations removed and run-time speedup. We also show that our analyses may enable the use of simpler synchronization models than the model found in Java, at little or no additional cost in execution time. The synchronization optimizations, we describe enable programmers to design efficient, reusable and maintainable libraries and systems in Java without cumbersome manual code restructuring

    Phosphorylation of a splice variant of collapsin response mediator protein 2 in the nucleus of tumour cells links cyclin dependent kinase-5 to oncogenesis

    Get PDF
    Background Cyclin-dependent protein kinase-5 (CDK5) is an unusual member of the CDK family as it is not cell cycle regulated. However many of its substrates have roles in cell growth and oncogenesis, raising the possibility that CDK5 modulation could have therapeutic benefit. In order to establish whether changes in CDK5 activity are associated with oncogenesis one could quantify phosphorylation of CDK5 targets in disease tissue in comparison to appropriate controls. However the identity of physiological and pathophysiological CDK5 substrates remains the subject of debate, making the choice of CDK5 activity biomarkers difficult. Methods Here we use in vitro and in cell phosphorylation assays to identify novel features of CDK5 target sequence determinants that confer enhanced CDK5 selectivity, providing means to select substrate biomarkers of CDK5 activity with more confidence. We then characterize tools for the best CDK5 substrate we identified to monitor its phosphorylation in human tissue and use these to interrogate human tumour arrays. Results The close proximity of Arg/Lys amino acids and a proline two residues N-terminal to the phosphorylated residue both improve recognition of the substrate by CDK5. In contrast the presence of a proline two residues C-terminal to the target residue dramatically reduces phosphorylation rate. Serine-522 of Collapsin Response Mediator-2 (CRMP2) is a validated CDK5 substrate with many of these structural criteria. We generate and characterise phosphospecific antibodies to Ser522 and show that phosphorylation appears in human tumours (lung, breast, and lymphoma) in stark contrast to surrounding non-neoplastic tissue. In lung cancer the anti-phospho-Ser522 signal is positive in squamous cell carcinoma more frequently than adenocarcinoma. Finally we demonstrate that it is a specific and unusual splice variant of CRMP2 (CRMP2A) that is phosphorylated in tumour cells. Conclusions For the first time this data associates altered CDK5 substrate phosphorylation with oncogenesis in some but not all tumour types, implicating altered CDK5 activity in aspects of pathogenesis. These data identify a novel oncogenic mechanism where CDK5 activation induces CRMP2A phosphorylation in the nuclei of tumour cells

    The Effectiveness of Multiple Hardware Contexts

    No full text
    Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread working set can have a negative effect on cache conflict misses. In this paper we evaluate the two phenomena together, examining their combined effect on execution time. The usefulness of multiple hardware contexts depends on: program data locality, cache organization and degree of multiprocessing. Multiple hardware contexts are most effective on programs that have been optimized for data locality. For these programs, execution time dropped with increasing contexts, over widely varying architectures. With unoptimized applications, multiple contexts had limited value.The best performance was seen with only two contexts, and only on uniprocessors and small multiprocessors. The behavior of the unoptimized applications changed more noticeably with..

    Impact of Sharing-Based Thread Placement on Multithreaded Architectures

    No full text
    Multithreaded architectures context switch to another instruction stream to hide the latency of memory operations. Although the technique improves processor utilization, it can increase cache interference and degrade overall performance. One technique to reduce the interconnect traffic is to co-locate on the same processor threads that share data. The multi-thread sharing in the cache should reduce compulsory and invalidation misses, benefiting execution time. To test this hypothesis, we compared a variety of thread placement algorithms via trace-driven simulation of fourteen coarse- and medium-grain parallel applications on several multithreaded architectures. Our results contradict the hypothesis. Rather than decreasing, compulsory and invalidation misses remained fairly constant across all placement algorithms, for all processor configurations, even with an infinite cache. That is, sharing-based placement had no (positive) effect on execution time. Instead, load balancing was the cr..

    Static Analysis of Barrier Synchronization in Explicitly Parallel Programs

    No full text
    : Many coarse-grained, explicitly parallel programs execute in phases delimited by barriers to preserve sets of cross process data dependencies. One of the major obstacles to optimizing these programs is the necessity to conservatively assume that any two statements in the program may execute concurrently. Consequently, compilers fail to take advantage of opportunities to apply optimizing transformations, particularly those designed to improve data locality, both within and across the phases of the program. We present a simple and efficient compile time algorithm that uses the presence of barriers to perform non-concurrency analysis on coarse-grain, explicitly parallel programs. It works by dividing the program into a set of phases and computing the control flow between them. Each phase consists of one or more sequences of program statements that are delimited by barrier synchronization events and can execute concurrently. We show that the algorithm performs perfectly on all but one of..

    Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.

    No full text
    We have developed compiler algorithms that analyze coarse-grained, explicitly parallel programs and restructure their shared data to minimize the number of false sharing misses. The algorithms analyze the per-process data accesses to shared data, use this information to pinpoint the data structures that are prone to false sharing and choose an appropriate transformation to reduce it. The algorithms eliminated an average (across the entire workload) of 64% of false sharing misses, and in two programs more than 90%. However, how well the reduction in false sharing misses translated into improved execution time depended heavily on the memory subsystem architecture and previous programmer efforts to optimize for locality. On a multiprocessor with a large cache configuration and high cache miss penalty, the transformations improved the execution time of programmer-unoptimized applications by as much as 60%. However, on programs where previous programmer efforts to improve data locality had ..
    • …
    corecore